More work supporting objects larger than 4GB on Windows#2137
Conversation
On Windows, `unsigned long` and `long` are 32 bits even on 64-bit builds. The MSVC compatibility header has shimmed `ftruncate()` with #define ftruncate _chsize ever since `compat/msvc-posix.h` was introduced. `_chsize()` takes a 32-bit `long` for the new length, which silently truncates files (and the requested size) to 2 GiB. That is enough to make t7508 test 126 "git add fails gracefully with 4 GiB and 8 GiB files" fail under MSVC: `test-tool truncate` creates a sparse 4 GiB or 8 GiB file via the shimmed `ftruncate()`, and the test never gets off the ground. `_chsize_s()` is the modern replacement, accepts a 64-bit `__int64` length, and is the only sensible target on Windows. The catch is that it does not follow the POSIX `-1` + `errno` convention: it returns `0` on success and an errno value (a small positive integer) on failure. A plain `#define ftruncate _chsize_s` would therefore silently break callers that test the return value as `< 0` or against `-1`, of which there are several: `http.c`, `parallel-checkout.c`, and `t/helper/test-truncate.c` among them. Introduce a `static inline` wrapper that calls `_chsize_s()`, copies its errno return into `errno`, and translates the result to the familiar `-1` / `0` convention, then point `ftruncate` at the wrapper. Place the wrapper after `#include "mingw-posix.h"` so the `off_t` parameter resolves to the already-widened `off64_t` rather than the 32-bit `_off_t` from `compat/vcbuild/include/unistd.h`. MinGW is unaffected: its `ftruncate()` already takes `off_t` and routes through `ftruncate64()` when `_FILE_OFFSET_BITS=64`, which is the default in our build. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
`patch_delta()` takes the source and delta sizes by value and writes back the reconstructed target size through an `unsigned long *`. That datatype cannot represent a value that exceeds 4 GiB on systems where `unsigned long` is 32-bit (notably 64-bit Windows builds), though, even though the delta encoding itself, the on-disk layout, and the in-memory buffers happily carry such sizes. A `size_t` companion to `get_delta_hdr_size()`, `get_delta_hdr_size_sz()`, was introduced in 17fa077 (delta, packfile: use size_t for delta header sizes, 2026-05-08) precisely so that `patch_delta()` could be widened without changing the on-the-wire decoding helper's signature. Widen `patch_delta()`'s three size parameters to `size_t` and switch its internal use of `get_delta_hdr_size()` to the `_sz` variant. Then propagate the wider type through the callers. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
`write_reuse_object()` learned to track its packed-object size as `size_t` in 606c192 (odb, packfile: use size_t for streaming object sizes, 2026-05-08), but the comparison sink it feeds, `check_pack_inflate()`, still takes the expected decompressed size as `unsigned long`. The call site bridges the mismatch with `cast_size_t_to_ulong()`, which on Windows turns a >4 GiB object into an immediate die(). That function only uses `expect` once: as the right-hand side of a `stream.total_out == expect` equality test against zlib's counter. zlib's own `total_out` counter is `uLong` and is therefore still 32-bit-bound on Windows. Widening `expect` to `size_t` cannot fix that, but it is a strict improvement nonetheless: instead of dying outright, an oversized object now simply makes the equality fail and lets `write_reuse_object()` fall back to `write_no_reuse_object()`, which decompresses and re-deflates the content (and which the larger pack-objects widening series targets separately). Drop the `cast_size_t_to_ulong()` shim at the call site now that the receiving parameter speaks the same type as `entry_size`. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The topic `js/objects-larger-than-4gb-on-windows` widened the streaming, index-pack and unpack-objects paths to `size_t` but deliberately stopped at the in-memory `unpack_entry()` cascade, which still hands back the unpacked size through `unsigned long *`. On Windows that boundary truncates above 4 GiB because that data type is only 32 bits wide on that platform. Widen the code path. Except `packed_object_info_with_index_pos()`: It cannot yet pass `oi->sizep` directly because the field is still `unsigned long *`; bridge it with a `size_t` temporary that narrows back, and let a later commit drop the bridge once the field is wide too. `gfi_unpack_entry()` keeps its narrow signature because fast-import tracks sizes through `unsigned long` everywhere it crosses subsystem boundaries, keeping its signature allows the scope of this commit to be somewhat reasonable, still. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
`pack-objects` stores per-entry object sizes in either the 31-bit `size_` member of the `struct object_entry` or, when the value does not fit, the `pack->delta_size[]` spill array. The accessors (`oe_size`, `oe_delta_size`, `oe_get_size_slow`, `oe_size_*_than`) and the setters (`oe_set_size`, `oe_set_delta_size`) used `unsigned long` for the spill type, which on Windows means the spill silently caps at 4 GiB per entry. That is what made `upload-pack` die with "object too large to read on this platform" when serving the >4 GiB blob in `t5608` tests 5 and 6 when run with `GIT_TEST_CLONE_2GB`. Widen them all to `size_t` (including `pack->delta_size`) and drop the three `cast_size_t_to_ulong()` calls in `check_object()` that guarded `in_pack_size`. The two `SET_SIZE(entry, canonical_size)` calls in the same function stay cast-free as before, since `canonical_size` is still `unsigned long` until a later commit widens `object_info::sizep`. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
When I started the transition from `unsigned long` to `size_t`, in the interest of keeping the patches reviewable, I introduced these calls to prevent data type narrowing from silently failing to handle large object sizes. I also introduced `*_sz()` variants that would allow most of the callers to keep using that `unsigned long` that the 90s kindly asked to be returned. After the preceding commits, the only places that called the narrow wrappers either no longer exist or already use the `_sz` form internally, so the wrappers just narrow values back through `cast_size_t_to_ulong()` for no reason. Drop them and rename the `_sz` variants back to the natural names. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
When `js/objects-larger-than-4gb-on-windows` widened the streaming, index-pack and unpack-objects code paths, in the interest of keeping the patches somewhat reasonably-sized, it left the public ODB API still typed in `unsigned long`. In particular `struct object_info::sizep` and the four wrappers built on top of it (`odb_read_object`, `odb_read_object_peeled`, `odb_read_object_info`, `odb_pretend_object`) still return the unpacked size through `unsigned long *`, so on Windows `cat-file -s` and the `git add` / `git status` paths for a >4 GiB blob silently cap at 4 GiB. Widen the field and the four wrappers. The previous commits already widened the `unpack_entry()` cascade and pack-objects' in-core size accessors, so most of the cascade arrives here with no further work: the temporary shims in `packed_object_info_with_index_pos()` and in `unpack_entry()`'s delta-base recovery path go away, the two `SET_SIZE(entry, cast_size_t_to_ulong(canonical_size))` calls in `check_object()` and the matching one in `drop_reused_delta()` collapse to plain `SET_SIZE`, and `oe_get_size_slow()`'s tail `cast_size_t_to_ulong()` is gone too. What remains narrow are the boundaries this series does not intend to touch: the diff, blame, textconv and fast-import machinery. Even so, this patch is unfortunately quite large. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2491da9 to
f3aeae9
Compare
|
/submit |
|
Submitted as pull.2137.git.1780570272.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
| @@ -3232,7 +3232,7 @@ static int apply_binary_fragment(struct apply_state *state, | |||
| struct patch *patch) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:07AM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> `patch_delta()` takes the source and delta sizes by value and writes
> back the reconstructed target size through an `unsigned long *`. That
> datatype cannot represent a value that exceeds 4 GiB on systems where
> `unsigned long` is 32-bit (notably 64-bit Windows builds), though, even
> though the delta encoding itself, the on-disk layout, and the in-memory
> buffers happily carry such sizes. A `size_t` companion to
> `get_delta_hdr_size()`, `get_delta_hdr_size_sz()`, was introduced in
> 17fa077596 (delta, packfile: use size_t for delta header sizes,
> 2026-05-08) precisely so that `patch_delta()` could be widened without
> changing the on-the-wire decoding helper's signature.
>
> Widen `patch_delta()`'s three size parameters to `size_t` and switch
> its internal use of `get_delta_hdr_size()` to the `_sz` variant.
> Then propagate the wider type through the callers.
Does `get_delta_hdr_size()` have any remaining callers after this patch
series? I currently only spot two such callers, and you convert both of
them in this patch.
And can we reasonably add a test case that exercises this change?
> diff --git a/packfile.c b/packfile.c
> index 89366abfe3..e202f48837 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1964,10 +1964,8 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
> (uintmax_t)curpos, p->pack_name);
> data = NULL;
> } else {
> - unsigned long sz;
> data = patch_delta(base, base_size, delta_data,
> - delta_size, &sz);
> - size = sz;
> + delta_size, &size);
Nice that we get rid of this awkward construct.
Patrick|
User |
| @@ -453,7 +453,7 @@ static int check_pack_inflate(struct packed_git *p, | |||
| struct pack_window **w_curs, | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:08AM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> `write_reuse_object()` learned to track its packed-object size as
> `size_t` in 606c192380 (odb, packfile: use size_t for streaming
> object sizes, 2026-05-08), but the comparison sink it feeds,
> `check_pack_inflate()`, still takes the expected decompressed size
> as `unsigned long`. The call site bridges the mismatch with
> `cast_size_t_to_ulong()`, which on Windows turns a >4 GiB object
> into an immediate die().
>
> That function only uses `expect` once: as the right-hand side of a
> `stream.total_out == expect` equality test against zlib's counter.
> zlib's own `total_out` counter is `uLong` and is therefore still
> 32-bit-bound on Windows. Widening `expect` to `size_t` cannot fix that,
> but it is a strict improvement nonetheless: instead of dying outright,
> an oversized object now simply makes the equality fail and lets
> `write_reuse_object()` fall back to `write_no_reuse_object()`, which
> decompresses and re-deflates the content (and which the larger
> pack-objects widening series targets separately).
Hm. I wonder whether it's possible to reset `stream.total_out` on every
iteration and instead have a local `size_t` variable that we use to
track the total number of inflated bytes?
Patrick| @@ -1239,6 +1239,8 @@ static void *gfi_unpack_entry( | |||
| unsigned long *sizep) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:09AM +0000, Johannes Schindelin via GitGitGadget wrote:
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 82bc6dcc00..3dff898c43 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -1239,6 +1239,8 @@ static void *gfi_unpack_entry(
> unsigned long *sizep)
> {
> enum object_type type;
> + size_t size_st = 0;
> + void *data;
> struct packed_git *p = all_packs[oe->pack_id];
> if (p == pack_data && p->pack_size < (pack_size + the_hash_algo->rawsz)) {
> /* The object is stored in the packfile we are writing to
> @@ -1260,7 +1262,10 @@ static void *gfi_unpack_entry(
> */
> p->pack_size = pack_size + the_hash_algo->rawsz;
> }
> - return unpack_entry(the_repository, p, oe->idx.offset, &type, sizep);
> + data = unpack_entry(the_repository, p, oe->idx.offset, &type, &size_st);
> + if (sizep)
> + *sizep = cast_size_t_to_ulong(size_st);
> + return data;
> }
Nit, please feel free to ignore: do we want to add a NEEDSWORK comment
here?
Patrick| @@ -86,11 +86,8 @@ void *patch_delta(const void *src_buf, unsigned long src_size, | |||
| * This must be called twice on the delta data buffer, first to get the | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:11AM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> When I started the transition from `unsigned long` to `size_t`, in the
> interest of keeping the patches reviewable, I introduced these calls to
> prevent data type narrowing from silently failing to handle large object
> sizes. I also introduced `*_sz()` variants that would allow most of the
> callers to keep using that `unsigned long` that the 90s kindly asked to
> be returned.
>
> After the preceding commits, the only places that called the narrow
> wrappers either no longer exist or already use the `_sz` form
> internally, so the wrappers just narrow values back through
> `cast_size_t_to_ulong()` for no reason.
>
> Drop them and rename the `_sz` variants back to the natural names.
Aha, so you already address my comment I had on one of the preceding
patches :)
Patrick| @@ -3321,7 +3321,7 @@ static int apply_binary(struct apply_state *state, | |||
| if (odb_has_object(the_repository->objects, &oid, 0)) { | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:12AM +0000, Johannes Schindelin via GitGitGadget wrote:
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index fa45f774d7..fa6e396ddc 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -120,7 +120,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> struct object_id oid;
> enum object_type type;
> char *buf;
> - unsigned long size;
> + size_t size;
> struct object_context obj_context = {0};
> struct object_info oi = OBJECT_INFO_INIT;
> unsigned flags = OBJECT_INFO_LOOKUP_REPLACE;
> @@ -166,7 +166,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> if (use_mailmap && (type == OBJ_COMMIT || type == OBJ_TAG)) {
> size_t s = size;
> buf = replace_idents_using_mailmap(buf, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
>
> printf("%"PRIuMAX"\n", (uintmax_t)size);
Can't we drop this local variable completely and instead supply `&size`
directly?
> @@ -219,7 +225,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> if (use_mailmap) {
> size_t s = size;
> buf = replace_idents_using_mailmap(buf, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
>
> /* otherwise just spit out the data */
> @@ -266,7 +272,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> if (use_mailmap) {
> size_t s = size;
> buf = replace_idents_using_mailmap(buf, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
> break;
> }
> @@ -446,7 +455,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
> if (use_mailmap) {
> size_t s = size;
> contents = replace_idents_using_mailmap(contents, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
>
> if (type != data->type)
Likewise for these three instances.
> @@ -555,7 +564,7 @@ static void batch_object_write(const char *obj_name,
> if (!buf)
> die(_("unable to read %s"), oid_to_hex(&data->oid));
> buf = replace_idents_using_mailmap(buf, &s);
> - data->size = cast_size_t_to_ulong(s);
> + data->size = s;
>
> free(buf);
> }
And I think this site here can be adapted, as well.
> diff --git a/diff.c b/diff.c
> index 5a584fa1d5..816b89dc6c 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4594,8 +4594,9 @@ int diff_populate_filespec(struct repository *r,
> }
> }
> else {
> + size_t size_st = 0;
> struct object_info info = {
> - .sizep = &s->size
> + .sizep = &size_st
> };
>
> if (!(size_only || check_binary))
> @@ -4617,6 +4618,7 @@ int diff_populate_filespec(struct repository *r,
> die("unable to read %s", oid_to_hex(&s->oid));
>
> object_read:
> + s->size = cast_size_t_to_ulong(size_st);
> if (size_only || check_binary) {
> if (size_only)
> return 0;
> @@ -4631,6 +4633,7 @@ object_read:
> if (odb_read_object_info_extended(r->objects, &s->oid, &info,
> OBJECT_INFO_LOOKUP_REPLACE))
> die("unable to read %s", oid_to_hex(&s->oid));
> + s->size = cast_size_t_to_ulong(size_st);
> }
> s->should_free = 1;
> }
The flow in this function is quite weird if you ask me, but that's a
preexisting issue. This does look correct to me, even if it's awkward.
Patrick
This patch series tries to address the problems pointed out by the expensive tests that now run in CI: t5608 and t7508 verify various aspects about objects larger than 4GB, which Git does not currently handle correctly when run on a platform where
size_tis 64-bit andunsigned longis 32-bit.Unfortunately, this conflicts heavily with
ps/odb-source-loose. I rebased the branch ontoseenand pushed the result to https://github.com/dscho/git/tree/refs/heads/objects-larger-than-4gb-on-windows-pt2-seen, to make it easier to resolve merge conflicts. Here is the relevant range-diff:Cc: Signed-off-by: Kristofer Karlsson krka@spotify.com
cc: Patrick Steinhardt ps@pks.im